Time Series Forecasting

Step 01 Introduction

As a data scienst team, we are working with a real estate company in Beijing, China. We have been asked to determine whether sales for the real estate company are incerasing or declining.

This Time Series Forecasting project is another side of the other Beijing housing price project, especially regarding foecasting instead of regression.

Data

The housing price data is succeeded from step 3 in the other project. In addition, we also use some market prices or index to improve forecasting accuracy.

Step 02 Data Wranging

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import scale

from library.sb_utils import save_file

%matplotlib inline

# Optional Plotly Method Imports
import plotly
import cufflinks as cf
cf.go_offline()

# These for time series statistics
from statsmodels.tsa.stattools import adfuller,kpss,coint,bds,q_stat,grangercausalitytests,levinson_durbin
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from statsmodels.tsa.seasonal import seasonal_decompose      # for ETS Plots
from pmdarima import auto_arima                              # for determining ARIMA orders


from statsmodels.tools.eval_measures import rmse

Configuration

In [3]:
# Set the number of display
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

pd.options.display.float_format = '{:.4f}'.format

Load The Data

In [4]:
df = pd.read_csv('../data/df_data_step3a_EDA.csv',parse_dates = ['tradeTime'])
df_sales = pd.read_csv('../data/time_sales_data_step3a_EDA.csv',index_col='tradeTime',parse_dates = ['tradeTime'],)

df_10yrBond = pd.read_csv('../raw_data/China 10-Year Bond Yield Historical Data.csv',index_col='Date',parse_dates = True)
df_hmi  = pd.read_csv('../raw_data/Real Residential Property Prices for China.csv',index_col='DATE',parse_dates = True)
df_sse  = pd.read_csv('../raw_data/SSE Composite Index (000001.SS) Historical Data - Yahoo Finance.csv',index_col='Date',parse_dates = True)

Clean the data

China 10-Year Bond Yield Historical Data

Data overview

In [5]:
df_10yrBond.head()
Out[5]:
Price Open High Low Change %
Date
Jan 05 4.6470 4.6470 4.6470 4.6470 2.51%
Feb 05 4.6720 4.6720 4.6720 4.6720 0.54%
Mar 05 4.4110 4.4110 4.4110 4.4110 -5.59%
Apr 05 4.3570 4.3570 4.3570 4.3570 -1.22%
May 05 4.1330 4.1330 4.1330 4.1330 -5.14%

There are some unnecessary columns.

Type of index

In [6]:
df_10yrBond.index
Out[6]:
Index(['Jan 05', 'Feb 05', 'Mar 05', 'Apr 05', 'May 05', 'Jun 05', 'Jul 05',
       'Aug 05', 'Sep 05', 'Oct 05',
       ...
       'Feb 21', 'Mar 21', 'Apr 21', 'May 21', 'Jun 21', 'Jul 21', 'Aug 21',
       'Sep 21', 'Oct 21', 'Nov 21'],
      dtype='object', name='Date', length=203)

Frequency is 'MS'.

Clean data and rename the columns

We use only Date and price of the 10 years bond. Date is monthly freq, so it be renamed.

In [7]:
df_10yrBond.rename(columns={'Price':'10yrBond',},inplace=True)
df_10yrBond.drop(columns=[ 'Open', 'High', 'Low', 'Change %'],inplace=True)
df_10yrBond.index.names=['month']
# df_10yrBond.index.astype('datetime64[ns]')
In [8]:
df_10yrBond.index
Out[8]:
Index(['Jan 05', 'Feb 05', 'Mar 05', 'Apr 05', 'May 05', 'Jun 05', 'Jul 05',
       'Aug 05', 'Sep 05', 'Oct 05',
       ...
       'Feb 21', 'Mar 21', 'Apr 21', 'May 21', 'Jun 21', 'Jul 21', 'Aug 21',
       'Sep 21', 'Oct 21', 'Nov 21'],
      dtype='object', name='month', length=203)

Now, we transform the index format to the same one as the tradeTime in the original housing price data.

In [9]:
# https://stackoverflow.com/questions/58490016/convert-mmm-yy-to-date-time-in-python
df_10yrBond.index = pd.to_datetime(df_10yrBond.index,format='%b %y').strftime('20%y-%m-01')
In [10]:
df_10yrBond.index = pd.to_datetime(df_10yrBond.index)

Make a new dataset, called market, to use in the next steps

In [11]:
market = df_10yrBond.copy()

# set index frequency
market.index.freq='MS'

# Visually check the data
market.plot();

China Housing Market Index

Data overview

In [12]:
# Check the data
df_hmi.head()
Out[12]:
QCNR628BIS
DATE
2005-04-01 87.9389
2005-07-01 89.5385
2005-10-01 89.7531
2006-01-01 88.5259
2006-04-01 90.6989

Rename the columns

In [13]:
# Rename the coluns as hmi, home market index
df_hmi.rename(columns={'QCNR628BIS':'hmi'},inplace=True)
df_hmi.index.names=['quarter']
In [14]:
df_hmi.head()
Out[14]:
hmi
quarter
2005-04-01 87.9389
2005-07-01 89.5385
2005-10-01 89.7531
2006-01-01 88.5259
2006-04-01 90.6989

Add to market dataset

In [15]:
# Add and resent the name
market = pd.concat([market, df_hmi],axis=1)
market.index.names=['MS']

The original hmi has only quater based data. To convert it to mothly data, ffill method in fillna function be applied.

In [16]:
market.hmi = market.hmi.fillna(method='ffill')
market.head()
Out[16]:
10yrBond hmi
MS
2005-01-01 4.6470 NaN
2005-02-01 4.6720 NaN
2005-03-01 4.4110 NaN
2005-04-01 4.3570 87.9389
2005-05-01 4.1330 87.9389
In [17]:
# Check the graph
market['10yrBond'].plot();

SSE Composite Index (000001.SS) from Yahoo Finance

SSE (SHANGHAI STOCK EXCHANGE) is one of the major stock index in China.

Data overview

In [18]:
# Check the contents in the SSE data.
df_sse.head()
Out[18]:
Open High Low Close Adj Close Volume
Date
2005-01-01 1260.7820 1268.8560 1189.2111 1191.8230 1191.8230 197200
2005-02-01 1189.5010 1328.5330 1187.2600 1306.0031 1306.0031 223200
2005-03-01 1305.2490 1326.0760 1162.0310 1181.2360 1181.2360 310600
2005-04-01 1180.3730 1254.3190 1135.7330 1159.1460 1159.1460 319000
2005-05-01 1160.6200 1165.3900 1043.2760 1060.7380 1060.7380 181000

There are some unnecessary columns, which be removed.

In [19]:
# Drop unnecessary columns
df_sse.drop(columns=[ 'Open', 'High', 'Low','Adj Close', 'Volume'],inplace=True)

# Rename the columns
df_sse.index.names=['month']
df_sse.rename(columns={'Close':'sse'},inplace=True)
In [20]:
df_sse.head()
Out[20]:
sse
month
2005-01-01 1191.8230
2005-02-01 1306.0031
2005-03-01 1181.2360
2005-04-01 1159.1460
2005-05-01 1060.7380

Add to market dataset

In [21]:
market = pd.concat([market, df_sse],axis=1)
market.head()
Out[21]:
10yrBond hmi sse
2005-01-01 4.6470 NaN 1191.8230
2005-02-01 4.6720 NaN 1306.0031
2005-03-01 4.4110 NaN 1181.2360
2005-04-01 4.3570 87.9389 1159.1460
2005-05-01 4.1330 87.9389 1060.7380
In [22]:
market.info();
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 203 entries, 2005-01-01 to 2021-11-01
Freq: MS
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   10yrBond  203 non-null    float64
 1   hmi       200 non-null    float64
 2   sse       191 non-null    float64
dtypes: float64(3)
memory usage: 6.3 KB
In [23]:
# Check the graph
market.sse.plot();

We will smooth this sse in the same way as 10 years bond.

Total Sales data

This is from the file, Regression the step 3 EDA.

Add to market dataset

In [24]:
market = pd.concat([market, df_sales[['totalSales']]],axis=1)
In [25]:
# Check the graph
market['totalSales'].plot();

Step 03 Explore The Data

Time series data

Here, we check visually the relationship between the totalSales and other market prices for the period extended by 2 years, which are from 2010(2012-2) to 2020(2017+2).

In [26]:
# https://stackoverflow.com/questions/20356982/matplotlib-colour-multiple-twinx-axes
def threeTimeSeries(market):
    
    """
    This is a function to make a gragh showing 1 plus three inconsistent units of time series data.
    
    """


    fig, host = plt.subplots(
        figsize=(12, 6)
    )
    fig.subplots_adjust(
#         right=0.75
    )

    par1 = host.twinx()
    par2 = host.twinx()
    par3 = host.twinx()

    # move the spine of the second axes outwards
    par2.spines["right"].set_position(("axes", 1.1))
    par3.spines["right"].set_position(("axes", 1.2))
    

    p1, = host.plot(market.index, market.totalSales, 'r-', label="Total Sales")
    p2, = par1.plot(market.index, market['10yrBond'], 'g-', label="China 10-Year Bond")
    p3, = par2.plot(market.index, market.hmi, 'b-', label="China Housing Market Index")
    p4, = par3.plot(market.index, market['sse'], 'y-', label="SSE Composite Index")

    host.set_xlabel("Time")
    host.set_ylabel("Total Sales")
    par1.set_ylabel("China 10-Year Bond")
    par2.set_ylabel("China Housing Market Index")
    par3.set_ylabel("SSE Composite Index")

#     host.set_xlim(df_10yrBond.index.min(),df_10yrBond.index.max())

    lines = [p1, p2, p3, p4]
    host.legend(lines, [l.get_label() for l in lines],loc='upper left',)

    for ax in [par1, par2, par3]:
        ax.set_frame_on(True)
        ax.patch.set_visible(False)

        plt.setp(ax.spines.values(), visible=False)
        ax.spines["right"].set_visible(True)

    host.yaxis.label.set_color(p1.get_color())
    par1.yaxis.label.set_color(p2.get_color())
    par2.yaxis.label.set_color(p3.get_color())
    par3.yaxis.label.set_color(p4.get_color())

    host.spines["left"].set_edgecolor(p1.get_color())
    par1.spines["right"].set_edgecolor(p2.get_color())
    par2.spines["right"].set_edgecolor(p3.get_color())
    par3.spines["right"].set_edgecolor(p4.get_color())

    host.tick_params(axis='y', colors=p1.get_color())
    par1.tick_params(axis='y', colors=p2.get_color())
    par2.tick_params(axis='y', colors=p3.get_color())
    par3.tick_params(axis='y', colors=p4.get_color())

    host.spines['top'].set_visible(False)

    # host.legend(loc, bbox_to_anchor=(0.6,0.5))

    try:
        # Remove frame color
        frame = host.legend_.get_frame()
        frame.set_facecolor('1.0')
        frame.set_edgecolor('1.0')
    except:
        pass
In [27]:
threeTimeSeries(market.loc['2010-01-01':'2019-12-31'])

There seems to be more correlation between total sales and China 10-year Bond than the others.

Weekly Seasonal Decomposition

Here, we check the trend and seasonality of original total sales data, which is not yet smoothed as above.

In [28]:
# Import necessary libraries
from statsmodels.tsa.seasonal import seasonal_decompose

def plotseasonal(col):
    
    fig, axes = plt.subplots(ncols=1, nrows=4, sharex=True, figsize=(12,5))
    res = seasonal_decompose(df_sales[col].dropna())
    res.observed.plot(ax=axes[0], legend=False)
    axes[0].set_ylabel('Observed')
    res.trend.plot(ax=axes[1], legend=False)
    axes[1].set_ylabel('Trend')
    res.seasonal.plot(ax=axes[2], legend=False)
    axes[2].set_ylabel('Seasonal')
    res.resid.plot(ax=axes[3], legend=False)
    axes[3].set_ylabel('Residual')
    plt.tight_layout()
    plt.show()
In [29]:
from IPython import display
from ipywidgets import interact, widgets

interact(plotseasonal,
         col=widgets.ToggleButtons(
             options=list(df_sales.columns),
             value='totalSales',
             description='district:',
             disabled=False,
             button_style='', # 'success', 'info', 'warning', 'danger' or ''
            )
        );

There are trend adn seasonality are found.

Tests for Stationarity

We use Augmented Dickey-Fuller Test

In [30]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title 
    Returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")
        
# refer to www.pieriandata.com
In [31]:
adf_test(market['totalSales'].dropna(),title='Augmented Dickey-Fuller Test on totalSales Data')
Augmented Dickey-Fuller Test: Augmented Dickey-Fuller Test on totalSales Data
ADF test statistic     -2.8422
p-value                 0.0525
# lags used            12.0000
# observations         60.0000
critical value (1%)    -3.5444
critical value (5%)    -2.9111
critical value (10%)   -2.5932
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary
In [32]:
adf_test(market['totalSales'].diff().dropna(),title='Augmented Dickey-Fuller Test on totalSales Data')
Augmented Dickey-Fuller Test: Augmented Dickey-Fuller Test on totalSales Data
ADF test statistic     -1.3499
p-value                 0.6061
# lags used            12.0000
# observations         59.0000
critical value (1%)    -3.5464
critical value (5%)    -2.9119
critical value (10%)   -2.5937
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

One differenciation makes the stationality better.

In [33]:
adf_test(market['totalSales'].diff().diff().dropna(),title='Augmented Dickey-Fuller Test on totalSales Data')
Augmented Dickey-Fuller Test: Augmented Dickey-Fuller Test on totalSales Data
ADF test statistic     -2.1030
p-value                 0.2433
# lags used            12.0000
# observations         58.0000
critical value (1%)    -3.5485
critical value (5%)    -2.9128
critical value (10%)   -2.5941
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

Two differentiation does not necessarily make it better. Therefore, we will use one differenciation in the further modeling.

Change %

We use pandas pct_change() method on each column to create a column representing the increase(up) in each market value compared to previous month. The unit is percentage %.

In [34]:
pC = pd.DataFrame()
pC.index.names = ['MS']

for c in market.columns:
    pC[c+'_up%'] = market[c].pct_change()
pC.head()
Out[34]:
10yrBond_up% hmi_up% sse_up% totalSales_up%
MS
2005-01-01 NaN NaN NaN NaN
2005-02-01 0.0054 NaN 0.0958 NaN
2005-03-01 -0.0559 NaN -0.0955 NaN
2005-04-01 -0.0122 NaN -0.0187 NaN
2005-05-01 -0.0514 0.0000 -0.0849 NaN

Time Series - Feature Correlation

Here, we check the correlation of feature values and target value(totalSales).

Firstly, we maerge all data into a data set.

In [35]:
sns.pairplot(pC.loc['2012-01-01':'2017-12-31']);

Create a heat map

In [36]:
# Make a function to create a correlations map
def correlations(df,targetLabel,ncol=None,fs=(12,6)):
    
    
    if ncol is None:
        ncol = 1
    nrow = 1
    
    f, ax = plt.subplots(nrow,ncol,figsize=fs)
    
    c = list(df.columns)
    c.remove(targetLabel)
    c.append(targetLabel)

    # Compute the correlation matrix
    corr = df[c].corr()

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5},
                annot=True,
                fmt=".1f",
                ax=ax
               )
    plt.tight_layout();
In [37]:
correlations(pC,'totalSales_up%',fs=(5,5));

With regards to totalSales_up%, weak pairwise correlation of the percent changes are found in hmi and sse.

Correlation map By Year

In Regression Step 03 EDA, we have already explored the pairwise correlation of columns in the original housing price data(df).

Here, we explore in the same way, but with addional information of following three increase in market values.

  • '10yrBondSmul12_up%'
  • 'hmi_up%'
  • 'sseSmul12_up%'
In [38]:
df['MS']=pd.to_datetime(df.tradeTime,format='%Y-%m-%dd').dt.strftime('%Y-%m-01').astype(str)
pC = pC.reset_index()
pC.MS = pC.MS.astype(str)
df = df.merge(pC.drop(columns=['totalSales_up%']),how='left',on='MS')
df.drop(columns=['MS'],inplace=True)
In [39]:
mask_df = (df.tradeTime>='2015-01-01') & (df.tradeTime<='2017-12-31')
In [40]:
# Make a list, c, to put price in the last
def mapCorrelationsByYear(df, ncol=None, figsize=(15, 8)):
    """
    Input dataframe as df.
    Return triangular shape corr map By Year
    
    """    
    
    f, axes = plt.subplots(1,len(df.year.unique()),figsize=(11*len(df.year.unique()), 9*len(df.year.unique())))
    
    for i, year in enumerate(df.year.unique()):
        
        if ncol is None:
            ncol = 1
        nrow = 1

        temp=df[df.year==year].drop(columns='year')

        c = list(temp.columns)
        c.remove('price')
        c.append('price')

        # Compute the correlation matrix
        corr = temp[c].corr()

        # Generate a mask for the upper triangle
        mask = np.triu(np.ones_like(corr, dtype=bool))

        # Draw the heatmap with the mask and correct aspect ratio
        sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
                    square=True, linewidths=.5, cbar_kws={"shrink": .1},
                    annot=True,
                    fmt=".1f",
                    ax=axes[i]
                   )
        axes[i].set(title='year= '+str(year))
    
    plt.tight_layout();
In [41]:
mapCorrelationsByYear(df, ncol=None, figsize=(15, 8))

This is slightly hard to compare to see how each feature values affect price. Next, we just focus on the correlations with price.

Corelation bar By Year

In [42]:
def makeTrans(df):
    """
    Input dataframe 
    Return a new dataframe used to make 'Corelation bar By Year'.
    """
    for i,year in enumerate(df.year.unique()):
        temp=df[df.year==year].drop(columns='year').corr()
        if i==0:
            corrs = pd.DataFrame(columns=list(temp.columns))
            corrs.loc[year]=temp['price']
        else:
            corrs.loc[year]=temp['price']
    corrs.drop('price',axis=1,inplace=True)
    trans = corrs.T
    trans = trans.reset_index().rename(columns={'index':'features'}).melt('features', var_name='year')
    return trans
In [43]:
trans = makeTrans(df)
In [44]:
# https://github.com/jbmouret/matplotlib_for_papers/blob/master/src/plot_variance_matplotlib_white.py
def clean_layout(ax):
    
    """
    This function is only to make layout as cleaner as possible
    """
#     # Delete unnecessary lines
#     ax.spines['top'].set_visible(False)
#     ax.spines['right'].set_visible(False)
#     ax.spines['left'].set_visible(False)

    try:
        # Remove frame color
        frame = ax.legend_.get_frame()
        frame.set_facecolor('1.0')
        frame.set_edgecolor('1.0')
    except:
        pass

    # y axis grid line
    ax.grid(axis='y', color="0.9", linestyle='-', linewidth=1)
    ax.set_axisbelow(True)

    
In [45]:
def barCorrelationsByYear(trans):
    
    """
    Input dataframe as df
    Return bar corr map By Year
    
    """
    
    fig, ax = plt.subplots(1,1,figsize=(6,36))
    ax = sns.barplot(data=trans, x="value", hue="year", y="features", palette="Blues");
    
    rects = ax.patches
    
    # Make some labels
    labels = [f"label{i}" for i in range(len(rects))]
    
    # For each bar: Place a label
    for rect in rects:
        
        # Get X and Y placement of label from rect.
        x_value = rect.get_width()
        y_value = rect.get_y() + rect.get_height() / 2
        
        
        # Number of points between bar and label. Change to your liking.
        space = 5
        # Vertical alignment for positive values
        ha = 'left'

        # If value of bar is negative: Place label left of bar
        if x_value < 0:
            # Invert space to place label to the left
            space *= -1
            # Horizontally align label at right
            ha = 'right'

        # Use X value as label and format number with one decimal place
        label = "{:.2f}".format(x_value)

        # Create annotation
        plt.annotate(
            label,                      # Use `label` as label
            (x_value, y_value),         # Place label at end of the bar
            xytext=(space, 0),          # Horizontally shift label by `space`
            textcoords="offset points", # Interpret `xytext` as offset in points
            va='center',                # Vertically center label
            ha=ha)                      # Horizontally align label differently for
                                        # positive and negative values.

#     clean_layout(ax)
    ax.grid(axis='y')
    
    
In [46]:
barCorrelationsByYear(trans)

This simple corelation coefficients manifests following relationships.

  • Subway has the most strong positive relationship with price among the features.
  • constructionTime shows the most negative strong, meaning that the newer house has the lower price (totalPrice/square). ...showing some democracy

** totalPrice is exclueded here as it equals to square times price.

Useful data for forecasting

In oredr to find another useful data to improve forecasting, we perform Granger causality test.

The candidates are sse, bond, and hmi.

The function takes in a 2D [y(first column),x(second column)] and a maximum number of lags to test on x.

With regards to the two series, 𝑦 and 𝑥 , the null hypothesis is that lagged values of 𝑥 do not explain variations in 𝑦 .

totalSales vs 10yrBond

In [47]:
# Here y is set as 'totalSales', and x is set as '10yrBond' which is tested to be lagged.
grangercausalitytests(market[['totalSales','10yrBond']].loc['2012-01-01':'2017-12-31'].dropna(),maxlag=12);
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=6.9690  , p=0.0103  , df_denom=68, df_num=1
ssr based chi2 test:   chi2=7.2764  , p=0.0070  , df=1
likelihood ratio test: chi2=6.9272  , p=0.0085  , df=1
parameter F test:         F=6.9690  , p=0.0103  , df_denom=68, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=3.8771  , p=0.0257  , df_denom=65, df_num=2
ssr based chi2 test:   chi2=8.3506  , p=0.0154  , df=2
likelihood ratio test: chi2=7.8889  , p=0.0194  , df=2
parameter F test:         F=3.8771  , p=0.0257  , df_denom=65, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=2.0270  , p=0.1193  , df_denom=62, df_num=3
ssr based chi2 test:   chi2=6.7677  , p=0.0797  , df=3
likelihood ratio test: chi2=6.4560  , p=0.0914  , df=3
parameter F test:         F=2.0270  , p=0.1193  , df_denom=62, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.5429  , p=0.2016  , df_denom=59, df_num=4
ssr based chi2 test:   chi2=7.1128  , p=0.1300  , df=4
likelihood ratio test: chi2=6.7649  , p=0.1488  , df=4
parameter F test:         F=1.5429  , p=0.2016  , df_denom=59, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.0891  , p=0.3766  , df_denom=56, df_num=5
ssr based chi2 test:   chi2=6.5149  , p=0.2593  , df=5
likelihood ratio test: chi2=6.2173  , p=0.2856  , df=5
parameter F test:         F=1.0891  , p=0.3766  , df_denom=56, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.7101  , p=0.1368  , df_denom=53, df_num=6
ssr based chi2 test:   chi2=12.7773 , p=0.0467  , df=6
likelihood ratio test: chi2=11.6800 , p=0.0695  , df=6
parameter F test:         F=1.7101  , p=0.1368  , df_denom=53, df_num=6

Granger Causality
number of lags (no zero) 7
ssr based F test:         F=1.5555  , p=0.1707  , df_denom=50, df_num=7
ssr based chi2 test:   chi2=14.1546 , p=0.0485  , df=7
likelihood ratio test: chi2=12.8060 , p=0.0770  , df=7
parameter F test:         F=1.5555  , p=0.1707  , df_denom=50, df_num=7

Granger Causality
number of lags (no zero) 8
ssr based F test:         F=1.2999  , p=0.2668  , df_denom=47, df_num=8
ssr based chi2 test:   chi2=14.1610 , p=0.0777  , df=8
likelihood ratio test: chi2=12.7928 , p=0.1192  , df=8
parameter F test:         F=1.2999  , p=0.2668  , df_denom=47, df_num=8

Granger Causality
number of lags (no zero) 9
ssr based F test:         F=1.3579  , p=0.2363  , df_denom=44, df_num=9
ssr based chi2 test:   chi2=17.4986 , p=0.0415  , df=9
likelihood ratio test: chi2=15.4416 , p=0.0795  , df=9
parameter F test:         F=1.3579  , p=0.2363  , df_denom=44, df_num=9

Granger Causality
number of lags (no zero) 10
ssr based F test:         F=1.0552  , p=0.4173  , df_denom=41, df_num=10
ssr based chi2 test:   chi2=15.9562 , p=0.1009  , df=10
likelihood ratio test: chi2=14.1988 , p=0.1641  , df=10
parameter F test:         F=1.0552  , p=0.4173  , df_denom=41, df_num=10

Granger Causality
number of lags (no zero) 11
ssr based F test:         F=0.9314  , p=0.5217  , df_denom=38, df_num=11
ssr based chi2 test:   chi2=16.4468 , p=0.1253  , df=11
likelihood ratio test: chi2=14.5618 , p=0.2035  , df=11
parameter F test:         F=0.9314  , p=0.5217  , df_denom=38, df_num=11

Granger Causality
number of lags (no zero) 12
ssr based F test:         F=0.8976  , p=0.5580  , df_denom=35, df_num=12
ssr based chi2 test:   chi2=18.4649 , p=0.1023  , df=12
likelihood ratio test: chi2=16.0984 , p=0.1868  , df=12
parameter F test:         F=0.8976  , p=0.5580  , df_denom=35, df_num=12

Granger Causality shows that 1 month lag denies the null hypothesis. So, we can not refuse that 10yrBond explains variations in totalSales upto 2 months.

totalSales vs hmi

In [48]:
grangercausalitytests(market[['totalSales','hmi']].loc['2012-01-01':'2017-12-31'].dropna(),maxlag=12);
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1266  , p=0.7230  , df_denom=68, df_num=1
ssr based chi2 test:   chi2=0.1322  , p=0.7161  , df=1
likelihood ratio test: chi2=0.1321  , p=0.7163  , df=1
parameter F test:         F=0.1266  , p=0.7230  , df_denom=68, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.9328  , p=0.3986  , df_denom=65, df_num=2
ssr based chi2 test:   chi2=2.0092  , p=0.3662  , df=2
likelihood ratio test: chi2=1.9809  , p=0.3714  , df=2
parameter F test:         F=0.9328  , p=0.3986  , df_denom=65, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.9686  , p=0.4133  , df_denom=62, df_num=3
ssr based chi2 test:   chi2=3.2340  , p=0.3569  , df=3
likelihood ratio test: chi2=3.1605  , p=0.3675  , df=3
parameter F test:         F=0.9686  , p=0.4133  , df_denom=62, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.1851  , p=0.3268  , df_denom=59, df_num=4
ssr based chi2 test:   chi2=5.4633  , p=0.2430  , df=4
likelihood ratio test: chi2=5.2549  , p=0.2621  , df=4
parameter F test:         F=1.1851  , p=0.3268  , df_denom=59, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=0.9899  , p=0.4323  , df_denom=56, df_num=5
ssr based chi2 test:   chi2=5.9215  , p=0.3139  , df=5
likelihood ratio test: chi2=5.6743  , p=0.3392  , df=5
parameter F test:         F=0.9899  , p=0.4323  , df_denom=56, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.8920  , p=0.0994  , df_denom=53, df_num=6
ssr based chi2 test:   chi2=14.1365 , p=0.0281  , df=6
likelihood ratio test: chi2=12.8091 , p=0.0462  , df=6
parameter F test:         F=1.8920  , p=0.0994  , df_denom=53, df_num=6

Granger Causality
number of lags (no zero) 7
ssr based F test:         F=1.6950  , p=0.1317  , df_denom=50, df_num=7
ssr based chi2 test:   chi2=15.4241 , p=0.0309  , df=7
likelihood ratio test: chi2=13.8402 , p=0.0541  , df=7
parameter F test:         F=1.6950  , p=0.1317  , df_denom=50, df_num=7

Granger Causality
number of lags (no zero) 8
ssr based F test:         F=1.4361  , p=0.2068  , df_denom=47, df_num=8
ssr based chi2 test:   chi2=15.6446 , p=0.0478  , df=8
likelihood ratio test: chi2=13.9962 , p=0.0819  , df=8
parameter F test:         F=1.4361  , p=0.2068  , df_denom=47, df_num=8

Granger Causality
number of lags (no zero) 9
ssr based F test:         F=1.3737  , p=0.2291  , df_denom=44, df_num=9
ssr based chi2 test:   chi2=17.7020 , p=0.0388  , df=9
likelihood ratio test: chi2=15.6006 , p=0.0757  , df=9
parameter F test:         F=1.3737  , p=0.2291  , df_denom=44, df_num=9

Granger Causality
number of lags (no zero) 10
ssr based F test:         F=0.8583  , p=0.5777  , df_denom=41, df_num=10
ssr based chi2 test:   chi2=12.9798 , p=0.2248  , df=10
likelihood ratio test: chi2=11.7852 , p=0.2997  , df=10
parameter F test:         F=0.8583  , p=0.5777  , df_denom=41, df_num=10

Granger Causality
number of lags (no zero) 11
ssr based F test:         F=0.7269  , p=0.7060  , df_denom=38, df_num=11
ssr based chi2 test:   chi2=12.8364 , p=0.3042  , df=11
likelihood ratio test: chi2=11.6496 , p=0.3905  , df=11
parameter F test:         F=0.7269  , p=0.7060  , df_denom=38, df_num=11

Granger Causality
number of lags (no zero) 12
ssr based F test:         F=0.6292  , p=0.8029  , df_denom=35, df_num=12
ssr based chi2 test:   chi2=12.9432 , p=0.3732  , df=12
likelihood ratio test: chi2=11.7202 , p=0.4684  , df=12
parameter F test:         F=0.6292  , p=0.8029  , df_denom=35, df_num=12

As for hmi (home market index), there seems to be no causual relationship between 'totalSales' and 'hmi'.

totalSales vs sse

In [49]:
grangercausalitytests(market[['totalSales','sse']].loc['2012-01-01':'2017-12-31'].dropna(),maxlag=12);
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=4.2289  , p=0.0436  , df_denom=68, df_num=1
ssr based chi2 test:   chi2=4.4155  , p=0.0356  , df=1
likelihood ratio test: chi2=4.2836  , p=0.0385  , df=1
parameter F test:         F=4.2289  , p=0.0436  , df_denom=68, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=2.0956  , p=0.1312  , df_denom=65, df_num=2
ssr based chi2 test:   chi2=4.5135  , p=0.1047  , df=2
likelihood ratio test: chi2=4.3740  , p=0.1123  , df=2
parameter F test:         F=2.0956  , p=0.1312  , df_denom=65, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=2.9428  , p=0.0399  , df_denom=62, df_num=3
ssr based chi2 test:   chi2=9.8250  , p=0.0201  , df=3
likelihood ratio test: chi2=9.1855  , p=0.0269  , df=3
parameter F test:         F=2.9428  , p=0.0399  , df_denom=62, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=2.6816  , p=0.0401  , df_denom=59, df_num=4
ssr based chi2 test:   chi2=12.3626 , p=0.0148  , df=4
likelihood ratio test: chi2=11.3588 , p=0.0228  , df=4
parameter F test:         F=2.6816  , p=0.0401  , df_denom=59, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.6968  , p=0.1504  , df_denom=56, df_num=5
ssr based chi2 test:   chi2=10.1508 , p=0.0711  , df=5
likelihood ratio test: chi2=9.4516  , p=0.0924  , df=5
parameter F test:         F=1.6968  , p=0.1504  , df_denom=56, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.6482  , p=0.1523  , df_denom=53, df_num=6
ssr based chi2 test:   chi2=12.3147 , p=0.0553  , df=6
likelihood ratio test: chi2=11.2913 , p=0.0798  , df=6
parameter F test:         F=1.6482  , p=0.1523  , df_denom=53, df_num=6

Granger Causality
number of lags (no zero) 7
ssr based F test:         F=2.8126  , p=0.0150  , df_denom=50, df_num=7
ssr based chi2 test:   chi2=25.5944 , p=0.0006  , df=7
likelihood ratio test: chi2=21.5803 , p=0.0030  , df=7
parameter F test:         F=2.8126  , p=0.0150  , df_denom=50, df_num=7

Granger Causality
number of lags (no zero) 8
ssr based F test:         F=3.1740  , p=0.0059  , df_denom=47, df_num=8
ssr based chi2 test:   chi2=34.5760 , p=0.0000  , df=8
likelihood ratio test: chi2=27.6445 , p=0.0005  , df=8
parameter F test:         F=3.1740  , p=0.0059  , df_denom=47, df_num=8

Granger Causality
number of lags (no zero) 9
ssr based F test:         F=2.9657  , p=0.0076  , df_denom=44, df_num=9
ssr based chi2 test:   chi2=38.2168 , p=0.0000  , df=9
likelihood ratio test: chi2=29.8702 , p=0.0005  , df=9
parameter F test:         F=2.9657  , p=0.0076  , df_denom=44, df_num=9

Granger Causality
number of lags (no zero) 10
ssr based F test:         F=2.8202  , p=0.0093  , df_denom=41, df_num=10
ssr based chi2 test:   chi2=42.6470 , p=0.0000  , df=10
likelihood ratio test: chi2=32.4544 , p=0.0003  , df=10
parameter F test:         F=2.8202  , p=0.0093  , df_denom=41, df_num=10

Granger Causality
number of lags (no zero) 11
ssr based F test:         F=2.5894  , p=0.0146  , df_denom=38, df_num=11
ssr based chi2 test:   chi2=45.7230 , p=0.0000  , df=11
likelihood ratio test: chi2=34.1211 , p=0.0003  , df=11
parameter F test:         F=2.5894  , p=0.0146  , df_denom=38, df_num=11

Granger Causality
number of lags (no zero) 12
ssr based F test:         F=2.3093  , p=0.0269  , df_denom=35, df_num=12
ssr based chi2 test:   chi2=47.5052 , p=0.0000  , df=12
likelihood ratio test: chi2=34.9917 , p=0.0005  , df=12
parameter F test:         F=2.3093  , p=0.0269  , df_denom=35, df_num=12

As for sse(stock price), it is 8 months lag that it is the most loikely that Granger Causality denies the null hypothesis. Therefore, we can not refuse that sse explains variations in totalSales.

Among the three, we choose 10 yr bond as the most related feature to sales.

Step 04 Preprocessing and training

Split the data into train/test sets¶

In [50]:
data = market[['totalSales','10yrBond']].loc['2015-01-01':'2017-12-31']

y = data['totalSales']
exog1 = data['10yrBond']
# len(y),len(exog1)


# This is for Facebook's Prophet.
df_prophet = data[['totalSales']]
df_prophet.reset_index(inplace=True)
df_prophet.columns = ['ds','y']

# Define the split line between train and test data
split=int(len(data) * 0.8) 

# Make train and test variables, with 'train, test'
data_train, data_test = data.iloc[:split], data.iloc[split:]
y_train, y_test = y.iloc[:split].to_frame(), y.iloc[split:].to_frame()

Forecast using only time series data

In [51]:
# For SARIMA Orders we set seasonal=True and pass in an m value

auto_arima(y,
           seasonal=True,
           trace=True,
           error_action='ignore',
           suppress_warnings=True,
          ).summary()
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0] intercept   : AIC=1106.350, Time=0.02 sec
 ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=1110.370, Time=0.01 sec
 ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=1101.423, Time=0.01 sec
 ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=1101.358, Time=0.01 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=1165.177, Time=0.00 sec
 ARIMA(1,0,1)(0,0,0)[0] intercept   : AIC=1102.659, Time=0.01 sec
 ARIMA(0,0,2)(0,0,0)[0] intercept   : AIC=1103.701, Time=0.01 sec
 ARIMA(1,0,2)(0,0,0)[0] intercept   : AIC=1104.405, Time=0.02 sec
 ARIMA(0,0,1)(0,0,0)[0]             : AIC=1146.814, Time=0.01 sec

Best model:  ARIMA(0,0,1)(0,0,0)[0] intercept
Total fit time: 0.098 seconds
Out[51]:
SARIMAX Results
Dep. Variable: y No. Observations: 36
Model: SARIMAX(0, 0, 1) Log Likelihood -547.679
Date: Fri, 17 Dec 2021 AIC 1101.358
Time: 09:35:07 BIC 1106.108
Sample: 0 HQIC 1103.016
- 36
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 2.256e+06 3.32e+05 6.800 0.000 1.61e+06 2.91e+06
ma.L1 0.5502 0.179 3.074 0.002 0.199 0.901
sigma2 1.116e+12 0.102 1.09e+13 0.000 1.12e+12 1.12e+12
Ljung-Box (L1) (Q): 0.23 Jarque-Bera (JB): 1.69
Prob(Q): 0.63 Prob(JB): 0.43
Heteroskedasticity (H): 3.34 Skew: 0.53
Prob(H) (two-sided): 0.05 Kurtosis: 2.84


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 6.35e+28. Standard errors may be unstable.

SARIMA model

To make a time-series forecast, firstly we set some parameters necessary for the forecast.

In [52]:
# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
from itertools import product

# Make a function called evaluate_sarimax_model
def evaluate_sarimax_model(data, order,seasonal_order, exog=None):
    """
    Input dataframe as data
    and order and seasonal_order to be used in the SARIMAX
    
    Return error_mse,error_rmse, y_test, f(forecast on y_test)
    """
    
    
    # Needs to be an integer because it is later used as an index.
    split=int(len(data) * 0.8)
    
    # Make train and test variables, with 'train, test'
    y_train, y_test = data.iloc[:split], data.iloc[split:]
    past=[x for x in y_train]
    
    
    try:
        exog_train, exog_test = exog.iloc[:split], exog.iloc[split:]
        e=exog_test[0]
    except:
        exog_train, exog_test,e=None,None,None
    
    try:
        exog_forecast = exog_test.to_frame()
        past_exog = [x for x in exog_train]
    except:
        exog_forecast=None
        past_exog=None
        
    
    
    
    # make predictions
    f = list()
    
    for i in range(len(y_test)):#timestep-wise comparison between test data and one-step prediction ARIMA model. 
        model = SARIMAX(past,order=order,seasonal_order=seasonal_order,exog=past_exog,
#                     enforce_invertibility=False,
                   )
        
        model_fit = model.fit(disp=0)
        future = model_fit.forecast(exog=e)[0]
        f.append(future)
        past.append(y_test[i])
        if past_exog!=None:
            past_exog.append(exog_test[i])
            e=exog_test[i]
    
    
    # calculate out of sample error
    error_mse = mean_squared_error(y_test, f)
    error_rmse = rmse(y_test, f)
    
    # Return the error
    return [error_mse,error_rmse, y_test, f]
In [53]:
error_mse, error_rmse, test, f = evaluate_sarimax_model(y,(0,0,1),(0,0,1,12), exog=None)
print('error_mse %.5e, error_rmse %.0f'% (error_mse, error_rmse))
error_mse 2.90493e+11, error_rmse 538974
In [54]:
ModelComparison = pd.DataFrame([{'Model name':'SARIMA with auto_arima',
                                'MAE':error_mse,
                                'Best para':'(0,0,1),(0,0,1,12)',
                                }
                                ])
ModelComparison
Out[54]:
Model name MAE Best para
0 SARIMA with auto_arima 290493316389.1049 (0,0,1),(0,0,1,12)

Compare predictions to expected values

Visualize the results.

In [55]:
# Plot predictions against known values
title ='Monthly Beijing Housing Total sales'
ylabel='Total sales'
xlabel='Month'

# Refit the model
model = SARIMAX(y_train, order=(0,0,1),seasonal_order=(0,0,1,12))
model_fit = model.fit(disp=0)
predictions = model_fit.forecast(len(y_test))


# ax = market['totalSales'].loc['2015-01-01':'2017-12-31'].plot(legend=True,figsize=(12,6),title=title)
ax = y.plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

This is the result based on the parameters advised by the auto_arima, and we use this as base model.

Step 05 Modeling

SARIMA with Grid Search for different p, d, and q values.

I have tried making use of gridsearcv function, but it kept causing error, so here we do manually with for loops.

In [56]:
# Make a function to evaluate different ARIMA models with several different p, d, and q values.

def evaluate_models(y, p_values, d_values, q_values, P_values,D_values,Q_values, m_values,exog=None):
    best_score, best_odrder, best_sesonal_order = float("inf"), None, None
    grid_search = pd.DataFrame(columns=['p','d','q','P','D','Q','m','mse','rmse'])
    # Iterate through p_values
    for p in p_values:
        # Iterate through d_values
        for d in d_values:
            # Iterate through q_values
            for q in q_values:
                
                for P in P_values:
                    
                    for D in D_values:
                        
                        for Q in Q_values:
                            
                            for m in m_values:
                                
                                # p, d, q iterator variables in that order
                                order = (p,d,q)

                                seasonal_order = (P,D,Q,m)
#                                 print('==Now working on order%s,seasonal_order%s==' % (order,seasonal_order))
                                mse,rmse=None,None
                                ps=[p,d,q,P,D,Q,m,mse,rmse]
                                params = {'p':p, 'd':d, 'q':q, 'P':P, 'D':D, 'Q':Q, 'm':m, 'mse':mse,'rmse':rmse}
                                try:
                                    # Make a variable called mse for the Mean squared error
                                    mse,rmse = evaluate_sarimax_model(y, order, seasonal_order,exog=exog)[:2]
                                    grid_search = grid_search.append(pd.Series(ps, index = grid_search.columns), ignore_index=True)
                                    if mse < best_score:
                                        best_score, best_odrder, best_sesonal_order = mse, order, seasonal_order
#                                     print('SARIMAX %s,%s \t\tMSE=%.3f' % (order,seasonal_order,mse))
                                except:
                                    continue
#                                     print('MSE=%.3f RMSE=%.3f' % (mse,rmse))

    
            
    print(grid_search.dropna().sort_values('mse').head(1))                   
    print('Best SARIMAX %s %s MSE=%.3f' % (best_odrder, best_sesonal_order, best_score))
    return [best_odrder, best_sesonal_order, best_score]
In [57]:
p_values = [x for x in range(0, 3)]
d_values = [x for x in range(0, 3)]
q_values = [x for x in range(0, 3)]
P_values = [x for x in range(0, 3)]
D_values = [x for x in range(0, 3)]
Q_values = [x for x in range(0, 3)]
m_values = [3,4,6,12]
In [58]:
import json

try:
    
    with open("../data/sarima.json", 'r') as f:
        best_odrder1, best_sesonal_order1, best_score1 = json.load(f)
except:
    
    import warnings
    warnings.filterwarnings("ignore")
    
    best_odrder1, best_sesonal_order1, best_score1 = evaluate_models(y, p_values, d_values, q_values, P_values,D_values,Q_values, m_values)
    
    with open("../data/sarima.json", 'w') as f:
        # indent=2 is not needed but makes the file human-readable
        json.dump( [best_odrder1, best_sesonal_order1, best_score1], f) 
In [59]:
print('According to this grid search, \
the best model parameter is SARIMAX %s %s and MSE is %.5e'\
       % (best_odrder1, best_sesonal_order1, best_score1))
       
       
According to this grid search, the best model parameter is SARIMAX [1, 0, 0] [1, 0, 1, 4] and MSE is 5.79619e+10
In [60]:
# Add the metrics to the ModelComparison
def addModel(df1,name,best_score,Best_para):
    # Set a variable for Best para
    
    df2=pd.DataFrame([{'Model name':str(name),
                        'mse':best_score,
#                         'RMSE':np.sqrt(mean_squared_error(y_test,predictions)),
#                         'Explained_variance_score':explained_variance_score(y_test,predictions),
                        'Best para': Best_para,
                      } 
                    ])
    return pd.concat([df1,df2],axis=0)
In [61]:
ModelComparison = addModel(ModelComparison,'SARIMA with Grid Search',best_score1,str(best_odrder1)+','+str(best_sesonal_order1))
ModelComparison
Out[61]:
Model name MAE Best para mse
0 SARIMA with auto_arima 290493316389.1049 (0,0,1),(0,0,1,12) NaN
0 SARIMA with Grid Search NaN [1, 0, 0],[1, 0, 1, 4] 57961923350.2968
In [62]:
# Plot predictions against known values
title =''
ylabel='Total sales'
xlabel='Month'

# Refit the model
model = SARIMAX(y_train, order=best_odrder1,seasonal_order=best_sesonal_order1)
model_fit = model.fit(disp=0)
predictions = model_fit.forecast(len(y_test))



# ax = market['totalSales'].loc['2015-01-01':'2017-12-31'].plot(legend=True,figsize=(12,6),title=title)
ax = y.plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

This result is much better than the previous one.

SARIMAX, SARIMA with an exogenous variable

Fit a model

Among the three market values, 10yrBond had the lower p-value. Next, we make use of this 10yrBond.

In [63]:
error_mse, error_rmse, test, f = evaluate_sarimax_model(y,exog=exog1,order=best_odrder1,seasonal_order=best_sesonal_order1)
print('error_mse %.5e, error_rmse %.0f'% (error_mse, error_rmse))
error_mse 4.61054e+11, error_rmse 679010
In [64]:
predictions= data_test[[]].copy()
predictions['predictions']=f
In [65]:
# Plot predictions against known values
title =''
ylabel='Total sales'
xlabel='Month'



ax = y.plot(legend=True,figsize=(12,6),title=title)
predictions['predictions'].plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);
In [66]:
import json

try:
    
    with open("../data/sarimax.json", 'r') as f:
        best_odrder2, best_sesonal_order2, best_score2 = json.load(f)
except:
    
    import warnings
    warnings.filterwarnings("ignore")
    
    best_odrder2, best_sesonal_order2, best_score2 = evaluate_models(y, p_values, d_values, q_values, P_values,D_values,Q_values, m_values)
    
    with open("../data/sarimax.json", 'w') as f:
        # indent=2 is not needed but makes the file human-readable
        json.dump( [best_odrder2, best_sesonal_order2, best_score2], f) 
In [67]:
error_mse, error_rmse, test, f = evaluate_sarimax_model(y,exog=exog1,order=best_odrder2,seasonal_order=best_sesonal_order2,)
print('error_mse %.5e, error_rmse %.0f'% (error_mse, error_rmse))
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

error_mse 3.13901e+10, error_rmse 177172
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

This error is the lowest ever.

In [68]:
ModelComparison = addModel(ModelComparison,'SARIMAX with Grid Search',best_score2,str(best_odrder2)+' '+str(best_sesonal_order2))
ModelComparison
Out[68]:
Model name MAE Best para mse
0 SARIMA with auto_arima 290493316389.1049 (0,0,1),(0,0,1,12) NaN
0 SARIMA with Grid Search NaN [1, 0, 0],[1, 0, 1, 4] 57961923350.2968
0 SARIMAX with Grid Search NaN [2, 1, 2] [2, 0, 1, 12] 31390066067.7890
In [69]:
# Refit the model
predictions= data_test[[]].copy()
predictions['predictions']=f
In [70]:
# Plot predictions against known values
title =''
ylabel='Total sales'
xlabel='Month'

ax = y.plot(legend=True,figsize=(12,6),title=title)
predictions['predictions'].plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Facebeook's Prophet

In [71]:
import pandas as pd
from fbprophet import Prophet
%matplotlib inline
In [72]:
m = Prophet(
#     weekly_seasonality=True,
#     daily_seasonality=True,
    seasonality_mode='multiplicative',
)
m.fit(df_prophet)
future = m.make_future_dataframe(3, freq='MS')
forecast = m.predict(future)
fig = m.plot(forecast);
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Initial log joint probability = -48.7188
Iteration  1. Log joint probability =     4.1058. Improved by 52.8246.
Iteration  2. Log joint probability =    33.9404. Improved by 29.8346.
Iteration  3. Log joint probability =     42.811. Improved by 8.87058.
Iteration  4. Log joint probability =    45.7157. Improved by 2.90471.
Iteration  5. Log joint probability =    45.8209. Improved by 0.105114.
Iteration  6. Log joint probability =    45.8358. Improved by 0.0149531.
Iteration  7. Log joint probability =    45.8632. Improved by 0.0273917.
Iteration  8. Log joint probability =     45.885. Improved by 0.0218447.
Iteration  9. Log joint probability =    45.8917. Improved by 0.00669696.
Iteration 10. Log joint probability =    45.9027. Improved by 0.010979.
Iteration 11. Log joint probability =    45.9072. Improved by 0.00445423.
Iteration 12. Log joint probability =    45.9083. Improved by 0.00108658.
Iteration 13. Log joint probability =    45.9125. Improved by 0.00424969.
Iteration 14. Log joint probability =    45.9181. Improved by 0.00553977.
Iteration 15. Log joint probability =    45.9198. Improved by 0.00175283.
Iteration 16. Log joint probability =    45.9199. Improved by 5.04676e-05.
Iteration 17. Log joint probability =    45.9221. Improved by 0.00220224.
Iteration 18. Log joint probability =    45.9249. Improved by 0.0028697.
Iteration 19. Log joint probability =    45.9262. Improved by 0.0012622.
Iteration 20. Log joint probability =    45.9274. Improved by 0.00118812.
Iteration 21. Log joint probability =    45.9283. Improved by 0.000930904.
Iteration 22. Log joint probability =    45.9286. Improved by 0.000277968.
Iteration 23. Log joint probability =    45.9291. Improved by 0.000479469.
Iteration 24. Log joint probability =    45.9293. Improved by 0.000247492.
Iteration 25. Log joint probability =    45.9297. Improved by 0.000351087.
Iteration 26. Log joint probability =    45.9297. Improved by 1.7108e-05.
Iteration 27. Log joint probability =    45.9298. Improved by 0.000142046.
Iteration 28. Log joint probability =    45.9299. Improved by 4.9263e-05.
Iteration 29. Log joint probability =      45.93. Improved by 0.000103664.
Iteration 30. Log joint probability =    45.9301. Improved by 9.11048e-05.
Iteration 31. Log joint probability =    45.9302. Improved by 0.000112431.
Iteration 32. Log joint probability =    45.9303. Improved by 8.39049e-05.
Iteration 33. Log joint probability =    45.9303. Improved by 1.46484e-05.
Iteration 34. Log joint probability =    45.9303. Improved by 9.21706e-06.
Iteration 35. Log joint probability =    45.9303. Improved by 3.65791e-05.
Iteration 36. Log joint probability =    45.9303. Improved by 1.6927e-05.
Iteration 37. Log joint probability =    45.9304. Improved by 1.68595e-05.
Iteration 38. Log joint probability =    45.9304. Improved by 2.80863e-05.
Iteration 39. Log joint probability =    45.9304. Improved by 5.24653e-06.
Iteration 40. Log joint probability =    45.9304. Improved by 2.4171e-06.
Iteration 41. Log joint probability =    45.9304. Improved by 1.08241e-05.
Iteration 42. Log joint probability =    45.9304. Improved by 1.19894e-06.
Iteration 43. Log joint probability =    45.9304. Improved by 3.46692e-06.
Iteration 44. Log joint probability =    45.9304. Improved by 1.81356e-06.
Iteration 45. Log joint probability =    45.9304. Improved by 1.95896e-06.
Iteration 46. Log joint probability =    45.9304. Improved by 1.77057e-06.
Iteration 47. Log joint probability =    45.9304. Improved by 1.58361e-06.
Iteration 48. Log joint probability =    45.9304. Improved by 2.83109e-06.
Iteration 49. Log joint probability =    45.9304. Improved by 1.92458e-06.
Iteration 50. Log joint probability =    45.9304. Improved by 1.24229e-06.
Iteration 51. Log joint probability =    45.9304. Improved by 1.36355e-06.
Iteration 52. Log joint probability =    45.9304. Improved by 1.64097e-06.
Iteration 53. Log joint probability =    45.9304. Improved by 3.89797e-06.
Iteration 54. Log joint probability =    45.9304. Improved by 1.26288e-06.
Iteration 55. Log joint probability =    45.9304. Improved by 8.5713e-08.
Iteration 56. Log joint probability =    45.9304. Improved by 1.08077e-06.
Iteration 57. Log joint probability =    45.9304. Improved by 3.59432e-07.
Iteration 58. Log joint probability =    45.9304. Improved by 5.76157e-07.
Iteration 59. Log joint probability =    45.9304. Improved by 1.78667e-07.
Iteration 60. Log joint probability =    45.9304. Improved by 3.26157e-07.
Iteration 61. Log joint probability =    45.9304. Improved by 4.43667e-08.
Iteration 62. Log joint probability =    45.9304. Improved by 8.05856e-08.
Iteration 63. Log joint probability =    45.9304. Improved by 1.38816e-07.
Iteration 64. Log joint probability =    45.9304. Improved by 4.2026e-08.
Iteration 65. Log joint probability =    45.9304. Improved by 2.86879e-08.
Iteration 66. Log joint probability =    45.9304. Improved by 2.14271e-08.
Iteration 67. Log joint probability =    45.9304. Improved by 4.1393e-08.
Iteration 68. Log joint probability =    45.9304. Improved by 3.12805e-08.
Iteration 69. Log joint probability =    45.9304. Improved by 1.22802e-08.
Iteration 70. Log joint probability =    45.9304. Improved by 3.06564e-08.
Iteration 71. Log joint probability =    45.9304. Improved by 1.42549e-08.
Iteration 72. Log joint probability =    45.9304. Improved by 1.63799e-08.
Iteration 73. Log joint probability =    45.9304. Improved by 1.0816e-08.
Iteration 74. Log joint probability =    45.9304. Improved by 4.14009e-09.
In [73]:
fig = m.plot_components(forecast)
In [74]:
from fbprophet.plot import add_changepoints_to_plot
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

Error Evaluation

Here we calcualte the errors to compare with the others.

In [75]:
len(data_train)
Out[75]:
28
In [76]:
data_test['totalSales']
Out[76]:
2017-05-01    832191.8000
2017-06-01    905690.5000
2017-07-01   1240611.7000
2017-08-01   1307715.0000
2017-09-01   1590959.3000
2017-10-01   1269731.6000
2017-11-01   1491095.8000
2017-12-01   1676201.4000
Freq: MS, Name: totalSales, dtype: float64
In [77]:
start=len(data_train)
end=len(data_train)+len(data_test)-1

predictions = forecast.iloc[start:end+1]['yhat']
print('mse %.5e'% (mean_squared_error(predictions,data_test['totalSales'])))
mse 1.00977e+12

Although the time of preparation and calculation is much faster than the ARIMAX, this score of error was much worse. This may be because the data we used to train the model was too low.

In [78]:
ModelComparison = addModel(ModelComparison,"Facebook's Prophet",mean_squared_error(predictions,data_test['totalSales']),None)
ModelComparison
Out[78]:
Model name MAE Best para mse
0 SARIMA with auto_arima 290493316389.1049 (0,0,1),(0,0,1,12) NaN
0 SARIMA with Grid Search NaN [1, 0, 0],[1, 0, 1, 4] 57961923350.2968
0 SARIMAX with Grid Search NaN [2, 1, 2] [2, 0, 1, 12] 31390066067.7890
0 Facebook's Prophet NaN None 1009771003677.5349

RNN (Recurrent Neural Network)

Scale Data

In [79]:
from sklearn.preprocessing import MinMaxScaler
In [80]:
scaler = MinMaxScaler()
In [81]:
scaler.fit(y_train)
Out[81]:
MinMaxScaler()
In [82]:
scaled_train = scaler.transform(y_train)
scaled_test = scaler.transform(y_test)

Time Series Generator

In [83]:
from keras.preprocessing.sequence import TimeseriesGenerator
Using TensorFlow backend.
In [84]:
n_input = 12
n_features=1
generator = TimeseriesGenerator(
    scaled_train, 
    scaled_train, 
    length=n_input, 
    batch_size=1
)

Fit a Model

In [85]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
In [86]:
# define model
model = Sequential()
model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1)) # Only one predicted result is required. 
model.compile(optimizer='adam', loss='mse')
2021-12-17 09:35:16.892811: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-17 09:35:16.893324: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
In [87]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 150)               91200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 151       
=================================================================
Total params: 91,351
Trainable params: 91,351
Non-trainable params: 0
_________________________________________________________________
In [88]:
# fit model
model.fit_generator(generator,epochs=100)
Epoch 1/100
16/16 [==============================] - 1s 83ms/step - loss: 0.2782
Epoch 2/100
16/16 [==============================] - 0s 21ms/step - loss: 0.1340
Epoch 3/100
16/16 [==============================] - 0s 22ms/step - loss: 0.1140
Epoch 4/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0871
Epoch 5/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0795
Epoch 6/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0785
Epoch 7/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0807
Epoch 8/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0792
Epoch 9/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0812
Epoch 10/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0835
Epoch 11/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0952
Epoch 12/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0729
Epoch 13/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0826
Epoch 14/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0814
Epoch 15/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0884
Epoch 16/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0789
Epoch 17/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0768
Epoch 18/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0765
Epoch 19/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0779
Epoch 20/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0722
Epoch 21/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0743
Epoch 22/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0743
Epoch 23/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0756
Epoch 24/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0767
Epoch 25/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0755
Epoch 26/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0818
Epoch 27/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0761
Epoch 28/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0724
Epoch 29/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0800
Epoch 30/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0756
Epoch 31/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0774
Epoch 32/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0760
Epoch 33/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0745
Epoch 34/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0726
Epoch 35/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0713
Epoch 36/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0792
Epoch 37/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0806
Epoch 38/100
16/16 [==============================] - 0s 23ms/step - loss: 0.0767
Epoch 39/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0733
Epoch 40/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0713
Epoch 41/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0710
Epoch 42/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0712
Epoch 43/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0723
Epoch 44/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0724
Epoch 45/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0704
Epoch 46/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0739
Epoch 47/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0714
Epoch 48/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0744
Epoch 49/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0680
Epoch 50/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0679
Epoch 51/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0705
Epoch 52/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0736
Epoch 53/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0751
Epoch 54/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0732
Epoch 55/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0707
Epoch 56/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0813
Epoch 57/100
16/16 [==============================] - 0s 23ms/step - loss: 0.0665
Epoch 58/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0668
Epoch 59/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0669
Epoch 60/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0654
Epoch 61/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0659
Epoch 62/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0645
Epoch 63/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0764
Epoch 64/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0624
Epoch 65/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0631
Epoch 66/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0621
Epoch 67/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0619
Epoch 68/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0649
Epoch 69/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0576
Epoch 70/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0616
Epoch 71/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0609
Epoch 72/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0647
Epoch 73/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0636
Epoch 74/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0569
Epoch 75/100
16/16 [==============================] - 0s 24ms/step - loss: 0.0598
Epoch 76/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0565
Epoch 77/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0647
Epoch 78/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0657
Epoch 79/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0668
Epoch 80/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0817
Epoch 81/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0595
Epoch 82/100
16/16 [==============================] - 0s 23ms/step - loss: 0.0646
Epoch 83/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0595
Epoch 84/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0654
Epoch 85/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0555
Epoch 86/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0574
Epoch 87/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0586
Epoch 88/100
16/16 [==============================] - 0s 23ms/step - loss: 0.0590
Epoch 89/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0599
Epoch 90/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0566
Epoch 91/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0560
Epoch 92/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0576
Epoch 93/100
16/16 [==============================] - 0s 23ms/step - loss: 0.0542
Epoch 94/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0579
Epoch 95/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0524
Epoch 96/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0521
Epoch 97/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0521
Epoch 98/100
16/16 [==============================] - 0s 22ms/step - loss: 0.0533
Epoch 99/100
16/16 [==============================] - 0s 20ms/step - loss: 0.0550
Epoch 100/100
16/16 [==============================] - 0s 21ms/step - loss: 0.0676
Out[88]:
<keras.callbacks.callbacks.History at 0x7f8c5f069c90>
In [89]:
model.history.history.keys()
Out[89]:
dict_keys(['loss'])
In [90]:
loss_per_epoch = model.history.history['loss']
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)
Out[90]:
[<matplotlib.lines.Line2D at 0x7f8c5eddd550>]

Evaluate on Test Data¶

In [91]:
first_eval_batch = scaled_train[-split:]
In [92]:
first_eval_batch
Out[92]:
array([[0.09266475],
       [0.        ],
       [0.25862159],
       [0.32622773],
       [0.31198791],
       [0.29679101],
       [0.27482047],
       [0.29391198],
       [0.23094754],
       [0.26722633],
       [0.34973917],
       [0.56579551],
       [0.54973534],
       [0.36150792],
       [0.79772623],
       [0.35138097],
       [0.39011035],
       [0.42700577],
       [0.65765889],
       [1.        ],
       [0.94408871],
       [0.23328972],
       [0.29493374],
       [0.51558847],
       [0.3699155 ],
       [0.72905584],
       [0.92845269],
       [0.1221668 ]])
In [93]:
first_eval_batch = first_eval_batch.reshape((1, split, n_features))
In [94]:
# Generate predictions into the same time stamps as the test set

test_predictions = []

first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))

for i in range(len(y_test)):
    
    # get prediction 1 time stamp ahead ([0] is for grabbing just the number instead of [array])
    current_pred = model.predict(current_batch)[0]
    
    # store prediction
    test_predictions.append(current_pred) 
    
    # update batch to now include prediction and drop first value
    current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)
    
In [95]:
### Inverse Transformations and Compare
test_predictions
Out[95]:
[array([0.40705922], dtype=float32),
 array([0.54838717], dtype=float32),
 array([0.8879891], dtype=float32),
 array([1.2820263], dtype=float32),
 array([0.9974545], dtype=float32),
 array([0.39300567], dtype=float32),
 array([0.34618282], dtype=float32),
 array([0.3624613], dtype=float32)]
In [96]:
true_predictions = scaler.inverse_transform(test_predictions)
In [97]:
true_predictions
Out[97]:
array([[2420143.24159727],
       [3059390.33692974],
       [4595459.83965787],
       [6377747.99977894],
       [5090588.03779441],
       [2356576.95255   ],
       [2144790.33274164],
       [2218420.2720959 ]])

Create a new dataframe that has both the original test values and your predictions for them.

In [98]:
y_test['Predictions'] = true_predictions
test.head()
Out[98]:
2017-05-01    832191.8000
2017-06-01    905690.5000
2017-07-01   1240611.7000
2017-08-01   1307715.0000
2017-09-01   1590959.3000
Freq: MS, Name: totalSales, dtype: float64
In [99]:
y_test.plot(figsize=(12,8));
In [100]:
# Plot predictions against known values
title ='Monthly Beijing Housing Total sales'
ylabel='Total sales'
xlabel='Month'

ax = y.plot(legend=True,figsize=(12,6),title=title)
y_test['Predictions'].plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Error Evaluation

In [101]:
print('mse %.5e'% (mean_squared_error(y_test['Predictions'],y_test.iloc[:,0])))
mse 7.28378e+12
In [102]:
y_test
Out[102]:
totalSales Predictions
2017-05-01 832191.8000 2420143.2416
2017-06-01 905690.5000 3059390.3369
2017-07-01 1240611.7000 4595459.8397
2017-08-01 1307715.0000 6377747.9998
2017-09-01 1590959.3000 5090588.0378
2017-10-01 1269731.6000 2356576.9526
2017-11-01 1491095.8000 2144790.3327
2017-12-01 1676201.4000 2218420.2721
In [103]:
ModelComparison = addModel(ModelComparison,'RNN',mean_squared_error(y_test['Predictions'],y_test['totalSales']),None)
ModelComparison
                           
Out[103]:
Model name MAE Best para mse
0 SARIMA with auto_arima 290493316389.1049 (0,0,1),(0,0,1,12) NaN
0 SARIMA with Grid Search NaN [1, 0, 0],[1, 0, 1, 4] 57961923350.2968
0 SARIMAX with Grid Search NaN [2, 1, 2] [2, 0, 1, 12] 31390066067.7890
0 Facebook's Prophet NaN None 1009771003677.5349
0 RNN NaN None 7283775674703.0820

Conclusion

Based on ModelComparison, the best model was SARIMAX with Grid Search. Next, we use this model.

Step 06 Two months Forecast

In [104]:
# Make a 
model = SARIMAX(y,exog=exog1,order=(2,1,2),seasonal_order=(2,0,1,12),enforce_invertibility=False)
results = model.fit(disp=0)


exog_forecast = market[['10yrBond']].loc['2017-12-31':]
fcast = results.predict(len(y)-1,
                        len(y)+2,
                        exog=exog_forecast[:3]).rename('SARIMAX(2, 1, 2) (2, 0, 1, 12) Forecast')
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

In [105]:
title='2 months Forecast'
xlabel='Month'
ylabel='Total sales'

ax = y.plot(legend=True,figsize=(16,6),title=title)
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Save the data

In [106]:
market.index.names=['month']
In [107]:
datapath = '../data'
save_file(df, 'df_data_step3b_EDA.csv', datapath)
save_file(market.reset_index(), 'time_sales_data_step3b_EDA.csv', datapath)
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/df_data_step3b_EDA.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/time_sales_data_step3b_EDA.csv"